53 research outputs found

    Poor starting points in machine learning

    Full text link
    Poor (even random) starting points for learning/training/optimization are common in machine learning. In many settings, the method of Robbins and Monro (online stochastic gradient descent) is known to be optimal for good starting points, but may not be optimal for poor starting points -- indeed, for poor starting points Nesterov acceleration can help during the initial iterations, even though Nesterov methods not designed for stochastic approximation could hurt during later iterations. The common practice of training with nontrivial minibatches enhances the advantage of Nesterov acceleration.Comment: 11 pages, 3 figures, 1 table; this initial version is literally identical to that circulated among a restricted audience over a month ag

    Testing the significance of assuming homogeneity in contingency-tables/cross-tabulations

    Full text link
    The model for homogeneity of proportions in a two-way contingency-table/cross-tabulation is the same as the model of independence, except that the probabilistic process generating the data is viewed as fixing the column totals (but not the row totals). When gauging the consistency of observed data with the assumption of independence, recent work has illustrated that the Euclidean/Frobenius/Hilbert-Schmidt distance is often far more statistically powerful than the classical statistics such as chi-square, the log-likelihood-ratio (G), the Freeman-Tukey/Hellinger distance, and other members of the Cressie-Read power-divergence family. The present paper indicates that the Euclidean/Frobenius/Hilbert-Schmidt distance can be more powerful for gauging the consistency of observed data with the assumption of homogeneity, too.Comment: 14 pages, 18 table

    A fast algorithm for computing minimal-norm solutions to underdetermined systems of linear equations

    Full text link
    We introduce a randomized algorithm for computing the minimal-norm solution to an underdetermined system of linear equations. Given an arbitrary full-rank m x n matrix A with m<n, any m x 1 vector b, and any positive real number epsilon less than 1, the procedure computes an n x 1 vector x approximating to relative precision epsilon or better the n x 1 vector p of minimal Euclidean norm satisfying Ap=b. The algorithm typically requires O(mn log(sqrt(n)/epsilon) + m**3) floating-point operations, generally less than the O(m**2 n) required by the classical schemes based on QR-decompositions or bidiagonalization. We present several numerical examples illustrating the performance of the algorithm.Comment: 13 pages, 4 table

    Regression-aware decompositions

    Full text link
    Linear least-squares regression with a "design" matrix A approximates a given matrix B via minimization of the spectral- or Frobenius-norm discrepancy ||AX-B|| over every conformingly sized matrix X. Another popular approximation is low-rank approximation via principal component analysis (PCA) -- which is essentially singular value decomposition (SVD) -- or interpolative decomposition (ID). Classically, PCA/SVD and ID operate solely with the matrix B being approximated, not supervised by any auxiliary matrix A. However, linear least-squares regression models can inform the ID, yielding regression-aware ID. As a bonus, this provides an interpretation as regression-aware PCA for a kind of canonical correlation analysis between A and B. The regression-aware decompositions effectively enable supervision to inform classical dimensionality reduction, which classically has been totally unsupervised. The regression-aware decompositions reveal the structure inherent in B that is relevant to regression against A.Comment: 19 pages, 9 figures, 2 table

    Recurrence relations and fast algorithms

    Full text link
    We construct fast algorithms for evaluating transforms associated with families of functions which satisfy recurrence relations. These include algorithms both for computing the coefficients in linear combinations of the functions, given the values of these linear combinations at certain points, and, vice versa, for evaluating such linear combinations at those points, given the coefficients in the linear combinations; such procedures are also known as analysis and synthesis of series of certain special functions. The algorithms of the present paper are efficient in the sense that their computational costs are proportional to n (ln n) (ln(1/epsilon))^3, where n is the amount of input and output data, and epsilon is the precision of computations. Stated somewhat more precisely, we find a positive real number C such that, for any positive integer n > 10, the algorithms require at most C n (ln n) (ln(1/epsilon))^3 floating-point operations and words of memory to evaluate at n appropriately chosen points any linear combination of n special functions, given the coefficients in the linear combination, where epsilon is the precision of computations.Comment: 24 page

    A fast randomized algorithm for orthogonal projection

    Full text link
    We describe an algorithm that, given any full-rank matrix A having fewer rows than columns, can rapidly compute the orthogonal projection of any vector onto the null space of A, as well as the orthogonal projection onto the row space of A, provided that both A and its adjoint can be applied rapidly to arbitrary vectors. As an intermediate step, the algorithm solves the overdetermined linear least-squares regression involving the adjoint of A (and so can be used for this, too). The basis of the algorithm is an obvious but numerically unstable scheme; suitable use of a preconditioner yields numerical stability. We generate the preconditioner rapidly via a randomized procedure that succeeds with extremely high probability. In many circumstances, the method can accelerate interior-point methods for convex optimization, such as linear programming (Ming Gu, personal communication).Comment: 13 pages, 6 table

    Testing goodness-of-fit for logistic regression

    Full text link
    Explicitly accounting for all applicable independent variables, even when the model being tested does not, is critical in testing goodness-of-fit for logistic regression. This can increase statistical power by orders of magnitude.Comment: 13 pages, 4 table

    A comparison of the discrete Kolmogorov-Smirnov statistic and the Euclidean distance

    Full text link
    Goodness-of-fit tests gauge whether a given set of observations is consistent (up to expected random fluctuations) with arising as independent and identically distributed (i.i.d.) draws from a user-specified probability distribution known as the "model." The standard gauges involve the discrepancy between the model and the empirical distribution of the observed draws. Some measures of discrepancy are cumulative; others are not. The most popular cumulative measure is the Kolmogorov-Smirnov statistic; when all probability distributions under consideration are discrete, a natural noncumulative measure is the Euclidean distance between the model and the empirical distributions. In the present paper, both mathematical analysis and its illustration via various data sets indicate that the Kolmogorov-Smirnov statistic tends to be more powerful than the Euclidean distance when there is a natural ordering for the values that the draws can take -- that is, when the data is ordinal -- whereas the Euclidean distance is more reliable and more easily understood than the Kolmogorov-Smirnov statistic when there is no natural ordering (or partial order) -- that is, when the data is nominal.Comment: 15 pages, 6 figures, 3 table

    An implementation of a randomized algorithm for principal component analysis

    Full text link
    Recent years have witnessed intense development of randomized methods for low-rank approximation. These methods target principal component analysis (PCA) and the calculation of truncated singular value decompositions (SVD). The present paper presents an essentially black-box, fool-proof implementation for Mathworks' MATLAB, a popular software platform for numerical computation. As illustrated via several tests, the randomized algorithms for low-rank approximation outperform or at least match the classical techniques (such as Lanczos iterations) in basically all respects: accuracy, computational efficiency (both speed and memory usage), ease-of-use, parallelizability, and reliability. However, the classical procedures remain the methods of choice for estimating spectral norms, and are far superior for calculating the least singular values and corresponding singular vectors (or singular subspaces).Comment: 13 pages, 4 figure

    Cumulative deviation of a subpopulation from the full population

    Full text link
    Assessing equity in treatment of a subpopulation often involves assigning numerical "scores" to all individuals in the full population such that similar individuals get similar scores; matching via propensity scores or appropriate covariates is common, for example. Given such scores, individuals with similar scores may or may not attain similar outcomes independent of the individuals' memberships in the subpopulation. The traditional graphical methods for visualizing inequities are known as "reliability diagrams" or "calibrations plots," which bin the scores into a partition of all possible values, and for each bin plot both the average outcomes for only individuals in the subpopulation as well as the average outcomes for all individuals; comparing the graph for the subpopulation with that for the full population gives some sense of how the averages for the subpopulation deviate from the averages for the full population. Unfortunately, real data sets contain only finitely many observations, limiting the usable resolution of the bins, and so the conventional methods can obscure important variations due to the binning. Fortunately, plotting cumulative deviation of the subpopulation from the full population as proposed in this paper sidesteps the problematic coarse binning. The cumulative plots encode subpopulation deviation directly as the slopes of secant lines for the graphs. Slope is easy to perceive even when the constant offsets of the secant lines are irrelevant. The cumulative approach avoids binning that smooths over deviations of the subpopulation from the full population. Such cumulative aggregation furnishes both high-resolution graphical methods and simple scalar summary statistics (analogous to those of Kuiper and of Kolmogorov and Smirnov used in statistical significance testing for comparing probability distributions).Comment: 70 pages, 51 figures, 2 tables; the new versions of the paper merge in most of arXiv:2006.0250
    • …
    corecore